ci(kai): add KaiBench evaluation workflow to CI pipeline [AI-2588] by jordanrburger · Pull Request #399 · keboola/mcp-server

jordanrburger · 2026-02-21T13:31:09Z

Description

Linear: AI-2588

Change Type

Major (breaking changes, significant new features)
Minor (new features, enhancements, backward compatible)
Patch (bug fixes, small improvements, no new features)

Summary

Adds automated KaiBench evaluations to the CI pipeline so MCP server changes are tested against the full AI agent stack before merging. This catches regressions where tool description changes, argument schema modifications, or behavior changes break the agent's ability to answer questions correctly.

How it works:

New kaibench.yml reusable workflow builds the MCP server Docker image from the PR branch, starts the full stack (MCP server + kai-assistant + Postgres + Redis), clones KaiBench, and runs evaluations against 3 question types (Data Analysis Query, Configuration Reasoning, Storage Object Reasoning)
ci.yml calls this workflow as a non-blocking job after the build passes (same-repo pushes only)
Results appear as a detailed GitHub step summary with per-question pass/fail, tool call counts, token usage, and duration
Artifacts (summary.json + results.jsonl) uploaded with 90-day retention
Also supports workflow_dispatch for manual runs with configurable question types

Before merging — secrets required:

Secret	Purpose
`KAIBENCH_REPO_TOKEN`	GitHub PAT to clone `keboola-rnd/KaiBench`
`KAIBENCH_STATIC_TOKEN`	Storage API token for canary-orion project 293
`KAIBENCH_MANAGEMENT_TOKEN`	Management API token
`KAIBENCH_API_URL`	`https://connection.canary-orion.keboola.dev`
`DOCKERHUB_TOKEN`	Docker Hub (for kai-assistant image)
`KAI_GOOGLE_VERTEX_CREDENTIALS`	Google Vertex AI service account JSON
`KAI_GOOGLE_VERTEX_PROJECT`	Vertex project ID
`KAI_GOOGLE_VERTEX_LOCATION`	Vertex location
`TURBO_TOKEN`	Turbo monorepo cache (for building kai-assistant from source)

Testing

Tested with Cursor AI desktop (Streamable-HTTP transports)

N/A — this is a CI-only change (GitHub Actions workflows). Testing plan:

Add required secrets to repo settings
Trigger kaibench.yml via workflow_dispatch to validate the full stack starts and evals run
Verify GitHub Actions step summary renders the results table
Download artifacts and verify summary.json + results.jsonl are present

Checklist

Self-review completed
Unit tests added/updated (if applicable) — N/A, CI workflow only
Integration tests added/updated (if applicable) — N/A, CI workflow only
Project version bumped according to the change type (if applicable) — N/A
Documentation updated (if applicable)

🤖 Generated with Claude Code

Add kaibench.yml reusable workflow that builds MCP server from the PR branch, starts the full stack (MCP server + kai-assistant + Postgres + Redis), and runs KaiBench evaluations. Triggered as a non-blocking job in ci.yml after the build passes (same-repo pushes only). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

linear · 2026-02-21T13:31:13Z

AI-2588 Evals in CI/CD

.github/workflows/kaibench.yml

Restricts GITHUB_TOKEN to read-only contents access to satisfy CodeQL security policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cli.py imports requests (for requests.JSONDecodeError handler) but it was not declared in pyproject.toml. This caused test collection to fail on Python 3.11 in CI where no transitive dependency pulls it in. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The repo was renamed from keboola/keboola-mcp-server to keboola/mcp-server, causing the if condition to evaluate false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Always use pre-built kai-assistant Docker image tag instead of building from the UI repo source. This removes the need for UI repo access and TURBO_TOKEN. The image tag is provided via vars.KAI_ASSISTANT_IMAGE_TAG. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of requiring a static image tag, the workflow now queries Docker Hub for the latest production-kai-assi-* tag when none is provided. This ensures CI always tests against the newest kai-assistant without manual variable updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove silent curl flags to surface errors when Docker Hub API calls fail during kai-assistant tag resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of running the full evaluation locally (which needs UI repo access, TURBO_TOKEN, and registry credentials), dispatch the eval to the KaiBench repo where all secrets are centralized. Results are posted back as a commit status on the MCP server repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-advanced-security bot found potential problems Feb 21, 2026

View reviewed changes

.github/workflows/kaibench.yml Fixed Show fixed Hide fixed

jordanrburger and others added 7 commits February 21, 2026 14:35

AI-2588: add permissions block to kaibench workflow

18b8dd1

Restricts GITHUB_TOKEN to read-only contents access to satisfy CodeQL security policy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-2588: fix repo name in kaibench job condition

4304763

The repo was renamed from keboola/keboola-mcp-server to keboola/mcp-server, causing the if condition to evaluate false. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-2588: add debug output to Docker Hub tag resolution

8eb32d9

Remove silent curl flags to surface errors when Docker Hub API calls fail during kai-assistant tag resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ci(kai): add KaiBench evaluation workflow to CI pipeline [AI-2588]#399

ci(kai): add KaiBench evaluation workflow to CI pipeline [AI-2588]#399
jordanrburger wants to merge 8 commits intomainfrom
AI-2588-kaibench-ci-evals

jordanrburger commented Feb 21, 2026

Uh oh!

linear bot commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

jordanrburger commented Feb 21, 2026

Description

Change Type

Summary

Testing

Checklist

Uh oh!

linear bot commented Feb 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant